details widget name

Text mining Lucene storage

Chapter details

The actual content of the text mining annotations is stored in high-performance Lucene indexes. This decision was taken when the inital approach, based purely on a relational database, reached a point where the overall performance and maintenance of the database was not satisfactory.

As mentioned in a previous chapter, each document is presented with sevral n-dimentional vectors. This representation is convenient for further processing of the document. However, it is not easily  stored and manuipulated in a relational database. One of the problems related to this representation is the amount of data that should be stored are database rows. Thus, we adoped another approach and the document vectors are stored in a Lucene index. Each document vector consists of numeric values so a simple "WhitespaceAnnalyzer" (org.apache.lucene.analysis.WhitespaceAnalyzer) is used for storing and searching in the indexes. The performance for storing and retreiving document information is constant in the scope of 1'000'000 documents with an average 23'000 tokens per document.

The utilization of Lucene as datasource allows us to execute queries with optimal performance, such as:

  1. finding a similar document to a selected one - in this case we use a modified version of the "MoreLikeThis" (org.apache.lucene.search.similar.MoreLikeThisQuery) Lucene query;
  2. finidng documents containing one or more common noun phrases or named entities;
  3. finding the most important concepts (noun phrases) for a given document.

Each document (vector) consists of two mandatory fields:

  1. uid - the document identificator;
  2. type - the type of the vector - this could be tokens, lemmas, nps, heads, nes or ne_* , where " * " is one of the named enitites types (location, date, person, etc...);

The rest of the fields are detoned of the type of the document vector. For example, the vector of tokens of a document additionally contains the "tokens" and "lemmas" fields; the vector for the noun phrases contains the "nps" and "heads" fields.

Each of the additinal document fields contains whitespace-separated integer numbers. The numbers correspond to indetificators (ids) of annotations, stored in the relational text mining database.